Introduction

About Me

To quote Nate Bargatze, “Hello World!”

As you can see above, my name is Kevin Schell. I am a practicing physical therapist with over twelve years of clinical experience working with adults admitted to a major academic medical center in Boston. I am a board-certified clinical specialist in geriatric physical therapy and have provided physical therapy care to adults across all specialties including critical care environments and the hospital’s emergency department. In fact, my time working in the emergency department ignited a new passion for me to take my clinical experience and (hopefully) positively impact the patient experience at a different level from one-on-one patient care. To this end, I am currently pursuing a Master of Science in Healthcare Data Analytics at the MGH Institute of Health Professions.

To supplement my learning, I have used different publicly available datasets to practice visualization techniques while looking for potential associations to further investigate with statistical analyses. Most recently, I utilized Employee Attrition from Healthcare from Kaggle. In particular, I worked through logistic regression modeling to identify independent risk factors associated with healthcare works leaving their jobs. Please feel free to peruse the rest of this site to see my efforts related to this project.

Dataset

The Employee Attrition from Healthcare dataset includes fabricated data. Overall, there are nearly 1,700 observations with 34 variables with near flawless data capture (i.e. minimal missing values and no duplicate observations). The primary outcome of interest for this analysis involves whether or not an individual left their job, captured in the variable Attrition. The potential independent variables include demographic information, work experiences, and quality of life measures. The author of the dataset referenced the US healthcare system. Given this, I am making some assumptions regarding units of measure. For example, I am assuming DistanceFromHome is captured in miles as opposed to kilometers, and compensation values are reported in US Dollars. I am also assuming that the self report measures follow typical 4-point Likert scales, however, this is never explicitly stated.

Addressing Variable Type and Levels

Several of the variables were interpreted as the wrong variable type after being read into R. For example, the main outcome of interest for this dataset Attrition was initially treated as a character/string variable. I converted this to a factor. This was a recurring issue across several variables. To reduce the volume of coding required, I converted the string/character variables to factor variables and then later updated ordinal variables with their respective levels respectively.

I also relabeled the JobRole level Admin to Administrative. This Admin level presented a bit of an issue as it is unclear whether it reflects a truly different level from Administrative versus represents an issue in how the information was collected (i.e. free text submission). I could have simply removed the limited number of observations with JobRole == Admin, but I decided to treat Admin as representing the same role group as Administrative.

if (!require(tidyverse)) install.packages("tidyverse")
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyverse)
if (!require(gt)) install.packages("gt")
## Loading required package: gt
library(gt)
if (!require(gtsummary)) install.packages("gtsummary")
## Loading required package: gtsummary
## #StandWithUkraine
library(gtsummary)
setwd("C:\\Users\\kjsch\\OneDrive\\Documents\\Schell Data Analytics Portfolio\\Spring 2024 Data Analytics Project")
att <- read.csv("watson_healthcare_modified.csv")
att$Attrition<-as.factor(att$Attrition)
att <- att %>%
  mutate(across(where(is.character), as.factor))
att$JobRole<-recode_factor(att$JobRole, "Admin" = "Administrative")
att$EnvironmentSatisfaction<-factor(att$EnvironmentSatisfaction, ordered  = TRUE, levels = c("1", "2", "3", "4"))
att$JobLevel<-factor(att$JobLevel, ordered  = TRUE, levels = c("1", "2", "3", "4"))
att$JobInvolvement<-factor(att$JobInvolvement, ordered = TRUE, levels = c("1", "2", "3", "4"))
att$JobSatisfaction<-factor(att$JobSatisfaction, ordered = TRUE, levels = c("1", "2", "3", "4"))
att$RelationshipSatisfaction<-factor(att$RelationshipSatisfaction, ordered = TRUE, level = c("1", "2", "3", "4"))
att$PerformanceRating<-factor(att$PerformanceRating, ordered = TRUE, levels = c("1", "2", "3", "4"))
att$BusinessTravel<-factor(att$BusinessTravel, ordered = TRUE, levels = c("Non-Travel", "Travel_Rarely", "Travel_Frequently"))
att$WorkLifeBalance<-factor(att$WorkLifeBalance, ordered = TRUE, levels = c("1", "2", "3", "4"))
att$Shift<-as.factor(att$Shift)

Addressing Missingness

Out of the 34 variables included in this dataset, only JobLevel contained missing values. There are 81 observations with “NA” as their value for JobLevel. It is possible that “NA” accurately reflects these observations’ JobLevel. In other words, there may be 81 respondents whose roles do not include ascending levels. However, it is also possible that some respondents may have inadvertently failed to respond to this question, or there may have been a transcription error when this dataset was generated from the raw data. Without knowing how this fabricated data was “collected,” it is impossible to know for certain.

Below you will find my initial analysis of missingness using the mice package. The analysis confirmed that only JobLevel had missing values. Summaries of datasets with all observations and one with observations removed for missing JobLevel are included below as well. Had the differences been more egregious, I would have considered imputation of JobLevel for those with missing levels. However, given the lack of clarity regarding the meaning of NA combined with only subtle differences and relatively small percentage of missing values, I decided I would remove the observations with missing values when specifying models.

if (!require(mice)) install.packages("mice")
## Loading required package: mice
## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
library(mice)
md.pattern(att)

##      EmployeeID Age Attrition BusinessTravel DailyRate Department
## 1595          1   1         1              1         1          1
## 81            1   1         1              1         1          1
##               0   0         0              0         0          0
##      DistanceFromHome Education EducationField EmployeeCount
## 1595                1         1              1             1
## 81                  1         1              1             1
##                     0         0              0             0
##      EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobRole
## 1595                       1      1          1              1       1
## 81                         1      1          1              1       1
##                            0      0          0              0       0
##      JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked
## 1595               1             1             1           1                  1
## 81                 1             1             1           1                  1
##                    0             0             0           0                  0
##      Over18 OverTime PercentSalaryHike PerformanceRating
## 1595      1        1                 1                 1
## 81        1        1                 1                 1
##           0        0                 0                 0
##      RelationshipSatisfaction StandardHours Shift TotalWorkingYears
## 1595                        1             1     1                 1
## 81                          1             1     1                 1
##                             0             0     0                 0
##      TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 1595                     1               1              1                  1
## 81                       1               1              1                  1
##                          0               0              0                  0
##      YearsSinceLastPromotion YearsWithCurrManager JobLevel   
## 1595                       1                    1        1  0
## 81                         1                    1        0  1
##                            0                    0       81 81

Descriptive Statistics of Dataset with Missing Values Included

att.continuous <- att %>%
select(Age, DailyRate, DistanceFromHome, Education, HourlyRate, MonthlyIncome, MonthlyRate, NumCompaniesWorked, PercentSalaryHike, StandardHours, TotalWorkingYears, YearsAtCompany, YearsInCurrentRole,YearsSinceLastPromotion, YearsWithCurrManager)
summary(att.continuous)
##       Age          DailyRate      DistanceFromHome   Education    
##  Min.   :18.00   Min.   : 102.0   Min.   : 1.000   Min.   :1.000  
##  1st Qu.:30.00   1st Qu.: 465.0   1st Qu.: 2.000   1st Qu.:2.000  
##  Median :36.00   Median : 796.5   Median : 7.000   Median :3.000  
##  Mean   :36.87   Mean   : 800.6   Mean   : 9.222   Mean   :2.908  
##  3rd Qu.:43.00   3rd Qu.:1157.0   3rd Qu.:14.000   3rd Qu.:4.000  
##  Max.   :60.00   Max.   :1499.0   Max.   :29.000   Max.   :5.000  
##    HourlyRate     MonthlyIncome    MonthlyRate    NumCompaniesWorked
##  Min.   : 30.00   Min.   : 1009   Min.   : 2094   Min.   :0.000     
##  1st Qu.: 48.00   1st Qu.: 2928   1st Qu.: 7993   1st Qu.:1.000     
##  Median : 65.50   Median : 4899   Median :14270   Median :2.000     
##  Mean   : 65.47   Mean   : 6517   Mean   :14287   Mean   :2.662     
##  3rd Qu.: 83.00   3rd Qu.: 8380   3rd Qu.:20462   3rd Qu.:4.000     
##  Max.   :100.00   Max.   :19999   Max.   :26999   Max.   :9.000     
##  PercentSalaryHike StandardHours TotalWorkingYears YearsAtCompany  
##  Min.   :11.0      Min.   :80    Min.   : 0.00     Min.   : 0.000  
##  1st Qu.:12.0      1st Qu.:80    1st Qu.: 6.00     1st Qu.: 3.000  
##  Median :14.0      Median :80    Median :10.00     Median : 5.000  
##  Mean   :15.2      Mean   :80    Mean   :11.34     Mean   : 7.033  
##  3rd Qu.:18.0      3rd Qu.:80    3rd Qu.:15.00     3rd Qu.:10.000  
##  Max.   :25.0      Max.   :80    Max.   :40.00     Max.   :40.000  
##  YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
##  Min.   : 0.000     Min.   : 0.0            Min.   : 0.000      
##  1st Qu.: 2.000     1st Qu.: 0.0            1st Qu.: 2.000      
##  Median : 3.000     Median : 1.0            Median : 3.000      
##  Mean   : 4.265     Mean   : 2.2            Mean   : 4.135      
##  3rd Qu.: 7.000     3rd Qu.: 3.0            3rd Qu.: 7.000      
##  Max.   :18.000     Max.   :15.0            Max.   :17.000
att.nomord<-att%>%
  select(!c(EmployeeID, Age, DailyRate, DistanceFromHome, Education, HourlyRate, MonthlyIncome, MonthlyRate, NumCompaniesWorked, PercentSalaryHike, StandardHours, TotalWorkingYears, TrainingTimesLastYear, YearsAtCompany, YearsInCurrentRole,YearsSinceLastPromotion, YearsWithCurrManager))
summary(att.nomord)
##  Attrition            BusinessTravel      Department           EducationField
##  No :1477   Non-Travel       : 172   Cardiology:531   Human Resources : 29   
##  Yes: 199   Travel_Rarely    :1184   Maternity :796   Life Sciences   :697   
##             Travel_Frequently: 320   Neurology :349   Marketing       :189   
##                                                       Medical         :524   
##                                                       Other           : 88   
##                                                       Technical Degree:149   
##  EmployeeCount EnvironmentSatisfaction    Gender    JobInvolvement JobLevel  
##  Min.   :1     1:330                   Female:678   1: 96          1   :621  
##  1st Qu.:1     2:326                   Male  :998   2:433          2   :606  
##  Median :1     3:512                                3:983          3   :246  
##  Mean   :1     4:508                                4:164          4   :122  
##  3rd Qu.:1                                                         NA's: 81  
##  Max.   :1                                                                   
##            JobRole    JobSatisfaction  MaritalStatus Over18   OverTime  
##  Administrative:131   1:329           Divorced:377   Y:1676   No :1200  
##  Nurse         :822   2:310           Married :777            Yes: 476  
##  Other         :534   3:507           Single  :522                      
##  Therapist     :189   4:530                                             
##                                                                         
##                                                                         
##  PerformanceRating RelationshipSatisfaction Shift   WorkLifeBalance
##  1:   0            1:310                    0:708   1:  90         
##  2:   0            2:346                    1:684   2: 385         
##  3:1424            3:526                    2:185   3:1028         
##  4: 252            4:494                    3: 99   4: 173         
##                                                                    
## 

Attrition by Job Level with NA Included

ggplot(data = att, aes(x = JobLevel, fill = Attrition)) +
  geom_bar(color = "dimgrey") +
  theme_test() +
  scale_fill_manual(values = c("red4", "green4")) +
  labs(
    title = "Attrition by Job Level",
    subtitle = ""
  )

addmargins(table(att$JobLevel,att$Attrition))
##      
##         No  Yes  Sum
##   1    479  142  621
##   2    572   34  606
##   3    228   18  246
##   4    119    3  122
##   Sum 1398  197 1595

Descriptive Statistics of Dataset with Observations with Missing Values Removed

att.na <- na.omit(att)
att.na.continuous <- att.na %>%
select(Age, DailyRate, DistanceFromHome, Education, HourlyRate, MonthlyIncome, MonthlyRate, NumCompaniesWorked, PercentSalaryHike, StandardHours, TotalWorkingYears, YearsAtCompany, YearsInCurrentRole,YearsSinceLastPromotion, YearsWithCurrManager)
summary(att.na.continuous)
##       Age          DailyRate      DistanceFromHome   Education    
##  Min.   :18.00   Min.   : 102.0   Min.   : 1.000   Min.   :1.000  
##  1st Qu.:30.00   1st Qu.: 465.0   1st Qu.: 2.000   1st Qu.:2.000  
##  Median :35.00   Median : 796.0   Median : 7.000   Median :3.000  
##  Mean   :36.31   Mean   : 799.6   Mean   : 9.389   Mean   :2.905  
##  3rd Qu.:42.00   3rd Qu.:1154.0   3rd Qu.:15.000   3rd Qu.:4.000  
##  Max.   :60.00   Max.   :1499.0   Max.   :29.000   Max.   :5.000  
##    HourlyRate     MonthlyIncome    MonthlyRate    NumCompaniesWorked
##  Min.   : 30.00   Min.   : 1009   Min.   : 2094   Min.   :0.000     
##  1st Qu.: 48.00   1st Qu.: 2860   1st Qu.: 7932   1st Qu.:1.000     
##  Median : 66.00   Median : 4724   Median :14284   Median :1.000     
##  Mean   : 65.44   Mean   : 5874   Mean   :14290   Mean   :2.622     
##  3rd Qu.: 83.00   3rd Qu.: 7365   3rd Qu.:20469   3rd Qu.:4.000     
##  Max.   :100.00   Max.   :17924   Max.   :26999   Max.   :9.000     
##  PercentSalaryHike StandardHours TotalWorkingYears YearsAtCompany  
##  Min.   :11.00     Min.   :80    Min.   : 0.00     Min.   : 0.000  
##  1st Qu.:12.00     1st Qu.:80    1st Qu.: 6.00     1st Qu.: 3.000  
##  Median :14.00     Median :80    Median : 9.00     Median : 5.000  
##  Mean   :15.25     Mean   :80    Mean   :10.59     Mean   : 6.643  
##  3rd Qu.:18.00     3rd Qu.:80    3rd Qu.:14.00     3rd Qu.: 9.000  
##  Max.   :25.00     Max.   :80    Max.   :40.00     Max.   :40.000  
##  YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
##  Min.   : 0.000     Min.   : 0.000          Min.   : 0.000      
##  1st Qu.: 2.000     1st Qu.: 0.000          1st Qu.: 2.000      
##  Median : 3.000     Median : 1.000          Median : 3.000      
##  Mean   : 4.153     Mean   : 2.092          Mean   : 4.003      
##  3rd Qu.: 7.000     3rd Qu.: 2.000          3rd Qu.: 7.000      
##  Max.   :17.000     Max.   :15.000          Max.   :17.000
att.na.nomord<-att.na%>%
  select(!c(EmployeeID, Age, DailyRate, DistanceFromHome, Education, HourlyRate, MonthlyIncome, MonthlyRate, NumCompaniesWorked, PercentSalaryHike, StandardHours, TotalWorkingYears, TrainingTimesLastYear, YearsAtCompany, YearsInCurrentRole,YearsSinceLastPromotion, YearsWithCurrManager))
summary(att.na.nomord)
##  Attrition            BusinessTravel      Department           EducationField
##  No :1398   Non-Travel       : 168   Cardiology:515   Human Resources : 24   
##  Yes: 197   Travel_Rarely    :1119   Maternity :777   Life Sciences   :666   
##             Travel_Frequently: 308   Neurology :303   Marketing       :180   
##                                                       Medical         :495   
##                                                       Other           : 84   
##                                                       Technical Degree:146   
##  EmployeeCount EnvironmentSatisfaction    Gender    JobInvolvement JobLevel
##  Min.   :1     1:313                   Female:649   1: 90          1:621   
##  1st Qu.:1     2:311                   Male  :946   2:413          2:606   
##  Median :1     3:492                                3:932          3:246   
##  Mean   :1     4:479                                4:160          4:122   
##  3rd Qu.:1                                                                 
##  Max.   :1                                                                 
##            JobRole    JobSatisfaction  MaritalStatus Over18   OverTime  
##  Administrative: 83   1:316           Divorced:357   Y:1595   No :1140  
##  Nurse         :822   2:293           Married :735            Yes: 455  
##  Other         :506   3:476           Single  :503                      
##  Therapist     :184   4:510                                             
##                                                                         
##                                                                         
##  PerformanceRating RelationshipSatisfaction Shift   WorkLifeBalance
##  1:   0            1:298                    0:678   1: 89          
##  2:   0            2:332                    1:639   2:367          
##  3:1351            3:499                    2:181   3:975          
##  4: 244            4:466                    3: 97   4:164          
##                                                                    
## 
att<-att.na

Descriptive Statistics and Visualizations

Descriptive Statistics

Continuous Variables

Based on the basic descriptive statistics, the sample at large reflects a relatively young group of workers that skews older that live relatively close to home and earn an average hourly rate well above the average for US healthcare workers (US Bureau of Labor Statistics, 2023). Additionally, most participants have worked at only one company, however, the sample is skewed greatly by those who have worked for several organizations. Interestingly, most participants have been promoted within the last year.

att.continuous <- att %>%
select(Age, Attrition, DailyRate, DistanceFromHome, HourlyRate, MonthlyIncome, MonthlyRate, NumCompaniesWorked, PercentSalaryHike, TotalWorkingYears, YearsAtCompany, YearsInCurrentRole,YearsSinceLastPromotion, YearsWithCurrManager)
att.continuous %>%
  select(!Attrition) %>%
  tbl_summary(
    statistic = list(
      all_continuous() ~ "{mean} ({sd})")
    )
Characteristic N = 1,5951
Age 36 (9)
DailyRate 800 (401)
DistanceFromHome 9 (8)
HourlyRate 65 (20)
MonthlyIncome 5,874 (3,866)
MonthlyRate 14,290 (7,153)
NumCompaniesWorked 3 (2)
PercentSalaryHike 15 (4)
TotalWorkingYears 11 (7)
YearsAtCompany 7 (6)
YearsInCurrentRole 4 (3)
YearsSinceLastPromotion 2 (3)
YearsWithCurrManager 4 (3)
1 Mean (SD)
att.continuous %>%
  tbl_summary(
    by = Attrition,
    statistic = list(
      all_continuous() ~ "{mean} ({sd})")
  ) %>% add_p()
Characteristic No, N = 1,3981 Yes, N = 1971 p-value2
Age 37 (9) 31 (9) <0.001
DailyRate 807 (402) 744 (394) 0.034
DistanceFromHome 9 (8) 12 (9) <0.001
HourlyRate 66 (20) 64 (19) 0.2
MonthlyIncome 6,157 (3,925) 3,867 (2,675) <0.001
MonthlyRate 14,177 (7,131) 15,092 (7,272) 0.10
NumCompaniesWorked 3 (2) 3 (3) >0.9
PercentSalaryHike 15 (4) 15 (4) 0.7
TotalWorkingYears 11 (7) 6 (5) <0.001
YearsAtCompany 7 (6) 4 (4) <0.001
YearsInCurrentRole 4 (4) 2 (2) <0.001
YearsSinceLastPromotion 2 (3) 1 (2) <0.001
YearsWithCurrManager 4 (3) 2 (3) <0.001
1 Mean (SD)
2 Wilcoxon rank sum test

Nominal and Ordinal Variables

The sample is imbalanced across the outcome of interest, Attrition. Additionally, male gender is grossly overrepresented. Nurse is the most common role identified, and most participants are single and early in their careers. Interestingly, most participants rate their WorkLifeBalance positively.

att.nomord<-att%>%
  select(!c(EmployeeID, Age, DailyRate, DistanceFromHome, HourlyRate, MonthlyIncome, MonthlyRate, NumCompaniesWorked, PercentSalaryHike, StandardHours, TotalWorkingYears, TrainingTimesLastYear, YearsAtCompany, YearsInCurrentRole,YearsSinceLastPromotion, YearsWithCurrManager))
tbl_summary(att.nomord)
Characteristic N = 1,5951
Attrition 197 (12%)
BusinessTravel
    Non-Travel 168 (11%)
    Travel_Rarely 1,119 (70%)
    Travel_Frequently 308 (19%)
Department
    Cardiology 515 (32%)
    Maternity 777 (49%)
    Neurology 303 (19%)
Education
    1 189 (12%)
    2 306 (19%)
    3 622 (39%)
    4 424 (27%)
    5 54 (3.4%)
EducationField
    Human Resources 24 (1.5%)
    Life Sciences 666 (42%)
    Marketing 180 (11%)
    Medical 495 (31%)
    Other 84 (5.3%)
    Technical Degree 146 (9.2%)
EmployeeCount 1,595 (100%)
EnvironmentSatisfaction
    1 313 (20%)
    2 311 (19%)
    3 492 (31%)
    4 479 (30%)
Gender
    Female 649 (41%)
    Male 946 (59%)
JobInvolvement
    1 90 (5.6%)
    2 413 (26%)
    3 932 (58%)
    4 160 (10%)
JobLevel
    1 621 (39%)
    2 606 (38%)
    3 246 (15%)
    4 122 (7.6%)
JobRole
    Administrative 83 (5.2%)
    Nurse 822 (52%)
    Other 506 (32%)
    Therapist 184 (12%)
JobSatisfaction
    1 316 (20%)
    2 293 (18%)
    3 476 (30%)
    4 510 (32%)
MaritalStatus
    Divorced 357 (22%)
    Married 735 (46%)
    Single 503 (32%)
Over18
    Y 1,595 (100%)
OverTime 455 (29%)
PerformanceRating
    1 0 (0%)
    2 0 (0%)
    3 1,351 (85%)
    4 244 (15%)
RelationshipSatisfaction
    1 298 (19%)
    2 332 (21%)
    3 499 (31%)
    4 466 (29%)
Shift
    0 678 (43%)
    1 639 (40%)
    2 181 (11%)
    3 97 (6.1%)
WorkLifeBalance
    1 89 (5.6%)
    2 367 (23%)
    3 975 (61%)
    4 164 (10%)
1 n (%)
att.nomord %>%
  tbl_summary(
    by = Attrition
  )
Characteristic No, N = 1,3981 Yes, N = 1971
BusinessTravel

    Non-Travel 152 (11%) 16 (8.1%)
    Travel_Rarely 995 (71%) 124 (63%)
    Travel_Frequently 251 (18%) 57 (29%)
Department

    Cardiology 441 (32%) 74 (38%)
    Maternity 680 (49%) 97 (49%)
    Neurology 277 (20%) 26 (13%)
Education

    1 162 (12%) 27 (14%)
    2 271 (19%) 35 (18%)
    3 534 (38%) 88 (45%)
    4 378 (27%) 46 (23%)
    5 53 (3.8%) 1 (0.5%)
EducationField

    Human Resources 18 (1.3%) 6 (3.0%)
    Life Sciences 583 (42%) 83 (42%)
    Marketing 152 (11%) 28 (14%)
    Medical 445 (32%) 50 (25%)
    Other 76 (5.4%) 8 (4.1%)
    Technical Degree 124 (8.9%) 22 (11%)
EmployeeCount 1,398 (100%) 197 (100%)
EnvironmentSatisfaction

    1 249 (18%) 64 (32%)
    2 273 (20%) 38 (19%)
    3 444 (32%) 48 (24%)
    4 432 (31%) 47 (24%)
Gender

    Female 563 (40%) 86 (44%)
    Male 835 (60%) 111 (56%)
JobInvolvement

    1 61 (4.4%) 29 (15%)
    2 342 (24%) 71 (36%)
    3 845 (60%) 87 (44%)
    4 150 (11%) 10 (5.1%)
JobLevel

    1 479 (34%) 142 (72%)
    2 572 (41%) 34 (17%)
    3 228 (16%) 18 (9.1%)
    4 119 (8.5%) 3 (1.5%)
JobRole

    Administrative 83 (5.9%) 0 (0%)
    Nurse 715 (51%) 107 (54%)
    Other 420 (30%) 86 (44%)
    Therapist 180 (13%) 4 (2.0%)
JobSatisfaction

    1 265 (19%) 51 (26%)
    2 252 (18%) 41 (21%)
    3 416 (30%) 60 (30%)
    4 465 (33%) 45 (23%)
MaritalStatus

    Divorced 333 (24%) 24 (12%)
    Married 675 (48%) 60 (30%)
    Single 390 (28%) 113 (57%)
Over18

    Y 1,398 (100%) 197 (100%)
OverTime 318 (23%) 137 (70%)
PerformanceRating

    1 0 (0%) 0 (0%)
    2 0 (0%) 0 (0%)
    3 1,186 (85%) 165 (84%)
    4 212 (15%) 32 (16%)
RelationshipSatisfaction

    1 258 (18%) 40 (20%)
    2 292 (21%) 40 (20%)
    3 432 (31%) 67 (34%)
    4 416 (30%) 50 (25%)
Shift

    0 539 (39%) 139 (71%)
    1 598 (43%) 41 (21%)
    2 175 (13%) 6 (3.0%)
    3 86 (6.2%) 11 (5.6%)
WorkLifeBalance

    1 65 (4.6%) 24 (12%)
    2 313 (22%) 54 (27%)
    3 877 (63%) 98 (50%)
    4 143 (10%) 21 (11%)
1 n (%)

Demographics

Please see each visualization’s subtitle for takeaway messages

Age, All Roles

ggplot(data = att, aes(Age)) +
  geom_histogram(binwidth = 2, color = "grey65",fill = "hotpink4") +
  geom_vline(aes(xintercept = median(Age)), color = "black", linewidth = 1.25) +
  theme_test()+
  labs(
    title = "Age of Respondents",
    subtitle = "Skews Older",
    x = "Age in Years",
    y = "Count of Respondents",
    caption = "Median Represented with Line"
  )

ggplot(data = att, aes(x = "", y=Age)) +
  geom_violin(color = "hotpink4") +
  geom_jitter(shape = 21, alpha = 0.25, color = "hotpink4") +
  theme_test() +
  theme(
    axis.ticks.x = element_blank(),
    axis.text.x = element_blank()
  ) +
  geom_boxplot(width = 0.15, color = "hotpink4", show.legend = FALSE) +
  labs(
    title = "Age of Respondents",
    subtitle = "Older Skew Apparent in Shape",
    x = "All Respondents",
    y = "Age in Years"
  )

Job Role

ggplot(data = att, aes(x = JobRole, fill = JobRole)) +
  geom_bar(color = "dimgrey", show.legend = FALSE) +
  theme_test() +
  scale_fill_manual(values = c("dodgerblue4", "hotpink4", "lightyellow4", "seagreen4")) +
  labs(
    title = "JobRole All Respondents",
    subtitle = "Nursing Role is Most Represented"
  )

Age by Job Role

ggplot(data = att, aes(Age, fill = JobRole)) +
  geom_histogram(binwidth = 2, color = "dimgrey", show.legend = FALSE) +
  facet_wrap(~att$JobRole) +
  theme_test()+
  scale_fill_manual(values = c("dodgerblue4", "hotpink4", "lightyellow4", "seagreen4")) +
  labs(
    title = "Age of Respondents by JobRole",
    subtitle = "Admin Relatively Older, Nurses with Narrower Distribution",
    x = "Age in Years",
    y = "Count of Respondents"
  )

ggplot(data = att, aes(x = JobRole, y = Age, color = JobRole)) +
  geom_violin(show.legend = FALSE) +
  geom_jitter(show.legend = FALSE, shape = 20, alpha = 0.33) +
  theme_test() +
  geom_boxplot(width = 0.15, show.legend = FALSE) +
  labs(
    title = "Age by JobRole",
    subtitle = "Admin/Administrative Relatively Older"
  ) +
  scale_color_manual(values = c("dodgerblue4", "hotpink4", "lightyellow4", "seagreen4"))

Gender

ggplot(data = att, aes(x = Gender, fill = Gender)) +
  geom_bar(show.legend = FALSE) +
  scale_fill_manual(values = c("hotpink4", "dodgerblue4")) +
  theme_test() +
  labs(
    title = "Gender", 
    subtitle = "Male Gender Overrepresented",
    y = "Count of Respondents"
  )

Gender by Job Role

ggplot(data = att, aes(x = JobRole, fill = Gender)) +
  geom_bar(color = "dimgrey") +
  theme_test() +
  scale_fill_manual(values = c("hotpink4", "dodgerblue4")) +
  labs(
    title = "Gender by JobRole",
    subtitle = "Male Gender Overrepresented in Nursing and Other Groups"
  )

Marital Status

ggplot(data = att, aes(x = MaritalStatus, fill = Gender)) +
  geom_bar(color = "dimgrey") +
  theme_test() +
  labs(
    title = "Marital Status All Respondents",
    subtitle = "Slightly Less Than Half are Currently Married"
  ) +
  scale_fill_manual(values = c("hotpink4", "dodgerblue4"))

ggplot(data = att, aes(x = MaritalStatus, fill = Gender)) +
  geom_bar(color = "dimgrey") +
  theme_test() +
  facet_wrap(~JobRole) +
  scale_fill_manual(values = c("hotpink4", "dodgerblue4")) +
  labs(
    title = "Marital Status by JobRole",
    subtitle = "Nurses and Other Group Greater Percent Currently Single"
  )

Work Experience

Please see each visualization’s subtitle for takeaway messages

Travel Distance, Home to Workplace

ggplot(data = att, aes(x = DistanceFromHome)) +
  geom_histogram(binwidth = 2, color = "dimgrey",fill = "grey65") +
  geom_vline(aes(xintercept = mean(DistanceFromHome)), color = "black", linewidth = 1.25) +
  theme_test() +
  scale_x_continuous(breaks = round(seq(min(att$DistanceFromHome), max(att$DistanceFromHome), by = 3),1)) +
  labs(
    title = "Travel Distance, Home to Workplace",
    subtitle = "Most Live Close to Workplace, Skews Farther Away ",
    caption = "Median Represented by Line"
  )

ggplot(data = att, aes(x = JobRole, y = DistanceFromHome, color = JobRole)) +
  geom_violin(show.legend = FALSE) +
  geom_jitter(show.legend = FALSE, shape = 20, alpha = 0.33) +
  theme_test() +
    scale_y_continuous(breaks = round(seq(min(att$DistanceFromHome), max(att$DistanceFromHome), by = 2),1)) +
  scale_color_manual(values = c("dodgerblue4", "hotpink4", "lightyellow4", "seagreen4")) +
  geom_boxplot(width = 0.15, show.legend = FALSE) +
  labs(
    title = "Distance from Home by JobRole",
    subtitle = "Relatively Similar Across Role Groups"
  )

Business Travel Requirements

ggplot(data = att, aes(x = BusinessTravel, fill= BusinessTravel)) +
  geom_bar(color = "dimgrey") +
  theme_test() +
  scale_fill_manual(values = c("cyan3", "cadetblue3", "cadetblue")) +
  labs(
    title = "Count of Respondents by Business Travel",
    subtitle = "Most Respondents Rarely Travel"
  )

ggplot(data = att, aes(x = BusinessTravel, fill = BusinessTravel)) +
  geom_bar(color = "dimgrey", show.legend = FALSE) +
  scale_fill_manual(values = c("cyan3", "cadetblue3", "cadetblue")) +
  theme_test() +
  facet_wrap(~JobRole) +
  labs(
    title = "Similar Work Travel Experiences across Roles",
    subtitle = "Limited no. of Respondents Required to Travel for Work"
  )

HourlyRate

ggplot(data = att, aes(x = HourlyRate)) +
  geom_histogram(binwidth = 5, color = "dimgrey",fill = "darkolivegreen4") +
  geom_vline(aes(xintercept = mean(HourlyRate)), color = "black", linewidth = 1.25) +
  theme_test() +
  scale_x_continuous(breaks = round(seq(min(att$HourlyRate), max(att$HourlyRate), by = 5),1)) +
  labs(
    title = "Hourly Wage Equivalent, All Respondents",
    subtitle = "Bimodal Distribution"
  )

ggplot(data = att, aes(x = JobRole, y = HourlyRate, color = JobRole)) +
  geom_violin(show.legend = FALSE) +
  geom_jitter(show.legend = FALSE, shape = 20, alpha = 0.33) +
  geom_boxplot(width = 0.15, show.legend = FALSE) +
  theme_test() +
  scale_color_manual(values = c("dodgerblue4", "hotpink4", "lightyellow4", "seagreen4")) +
  labs(
    title = "Hourly Wage Equivalent, By JobRole",
    subtitle = "Similar Wages across Role Groups"
  )

Percent Salary Hike

ggplot(data = att, aes(x = PercentSalaryHike)) +
  geom_histogram(binwidth = 1, color = "dimgrey",fill = "darkolivegreen4") +
  geom_vline(aes(xintercept = mean(PercentSalaryHike)), color = "black", linewidth = 1.25) +
  theme_test() +
  scale_x_continuous(breaks = round(seq(min(att$PercentSalaryHike), max(att$PercentSalaryHike), by = 1),1)) +
  labs(
    title = "Percent Salary Hike, All Respondents",
    subtitle = "Marked Skew Toward Higher Percent Salary Changes"
  )

ggplot(data = att, aes(x = JobRole, y = PercentSalaryHike, color = JobRole)) +
  geom_violin(show.legend = FALSE) +
  geom_jitter(show.legend = FALSE, shape = 20, alpha = 0.33) +
  geom_boxplot(width = 0.15, show.legend = FALSE) +
  theme_test() +
  scale_color_manual(values = c("dodgerblue4", "hotpink4", "lightyellow4", "seagreen4")) +
  labs(
    title = "Percent Salary Hike By JobRole",
    subtitle = "Median Therapist Percent Salary Change Higher, Otherwise Similar across Roles"
  )

Over Time vs Not

ggplot(data = att, aes(x = OverTime, fill = OverTime)) +
  geom_bar(color = "dimgrey")+
  theme_test()+
  labs(
    title = "Over Time Status - All Participants",
    subtitle = "Less Thank One Third Participate in Over Time"
  ) +
  scale_fill_manual(values = c("red4", "green4"))

ggplot(data = att, aes(x = OverTime, fill = OverTime)) +
  geom_bar(color = "dimgrey") +
  theme_test() +
  facet_wrap(~JobRole) +
  labs(
    title = "Over Time Participation by JobRole", 
  subtitle = "Other Role Group Appears to Have Higher Percentage of Over Time Participation") +
  scale_fill_manual(values = c("red4", "green4"))

Total Career Length

ggplot(data = att, aes(x = TotalWorkingYears)) +
  geom_histogram(binwidth = 2, color = "dimgrey",fill = "darkslategray") +
  geom_vline(aes(xintercept = median(TotalWorkingYears)), color = "black", linewidth = 1.25) +
  theme_test() +
  scale_x_continuous(breaks = round(seq(min(att$TotalWorkingYears), max(att$TotalWorkingYears), by = 2),1)) +
  labs(
    title = "Total Career Length, All Respondents",
    subtitle = "Marked Drop around 12 Years & Heavily Skewed toward Long Careers",
    caption = "Median Represented with Line"
  )

ggplot(data = att, aes(x = JobRole, y = TotalWorkingYears, color = JobRole)) +
  geom_violin(show.legend = FALSE) +
  geom_jitter(show.legend = FALSE, shape = 20, alpha = 0.33) +
  geom_boxplot(width = 0.15, show.legend = FALSE) +
  theme_test() +
    scale_y_continuous(breaks = round(seq(min(att$TotalWorkingYears), max(att$TotalWorkingYears), by = 5),1)) +
  scale_color_manual(values = c("dodgerblue4", "hotpink4", "lightyellow4", "seagreen4")) +
  labs(
    title = "Total Career Length By JobRole",
    subtitle = "Administrative Respondents with Longer Careers, All Roles Skew Right"
  )

Years at Current Company

ggplot(data = att, aes (x = YearsAtCompany)) +
  geom_histogram(binwidth = 2, color = "dimgrey", fill = "blue4") +
  geom_vline(xintercept = mean(att$YearsAtCompany, color = "black", linewidth = 1.25 )) +
  scale_x_continuous(breaks = round(seq(min(att$YearsAtCompany), max(att$YearsAtCompany), by = 2),1)) +
  theme_test() +
  labs(
    title = "Years at Current Company",
    subtitle = "Most Respondents with Less Than 10 Years at Company",
    caption = "Median Represented with Line"
  )

ggplot(data = att, aes(x = JobRole, y = YearsAtCompany, color = JobRole)) +
  geom_violin(show.legend = FALSE) +
  geom_jitter(show.legend = FALSE, shape = 20, alpha = 0.33) +
  geom_boxplot(width = 0.15, show.legend = FALSE) +
  theme_test() +
    scale_y_continuous(breaks = round(seq(min(att$YearsAtCompany), max(att$YearsAtCompany), by = 5),1)) +
  scale_color_manual(values = c("dodgerblue4", "hotpink4", "lightyellow4", "seagreen4")) +
  labs(
    title = "Years at Current Company by JobRole",
    subtitle = "Administrative Respondends with More Years at Company"
  )

Years in Current Role

ggplot(data = att, aes (x = YearsInCurrentRole)) +
  geom_histogram(binwidth = 1, color = "dimgrey", fill = "blue4") +
  geom_vline(xintercept = median(att$YearsInCurrentRole, color = "black", linewidth = 1.25 )) +
  scale_x_continuous(breaks = round(seq(min(att$YearsInCurrentRole), max(att$YearsInCurrentRole), by = 2),1)) +
  theme_test() +
  labs(
    title = "Years in Current Role",
    subtitle = "Multimodal Distribution",
    caption = "Median Represented with Line"
  )

ggplot(data = att, aes(x = JobRole, y = YearsInCurrentRole, color = JobRole)) +
  geom_violin(show.legend = FALSE) +
  geom_jitter(show.legend = FALSE, shape = 20, alpha = 0.33) +
  geom_boxplot(width = 0.15, show.legend = FALSE) +
  theme_test() +
   scale_y_continuous(breaks = round(seq(min(att$YearsInCurrentRole), max(att$YearsInCurrentRole), by = 5),1)) +
  scale_color_manual(values = c("dodgerblue4", "hotpink4", "lightyellow4", "seagreen4")) +
  labs(
    title = "Years in Current Role by JobRole",
    subtitle = "Nurse and Other with Greater Skew toward Longer Time in Role"
  )

Years Since Last Promotion

ggplot(data = att, aes (x = YearsSinceLastPromotion)) +
  geom_histogram(binwidth = 1, color = "dimgrey", fill = "azure3") +
  geom_vline(xintercept = median(att$YearsSinceLastPromotion, color = "black", linewidth = 1.25 )) +
  scale_x_continuous(breaks = round(seq(min(att$YearsSinceLastPromotion), max(att$YearsSinceLastPromotion), by = 2),1)) +
  theme_test() +
  labs(
    title = "Years Since Last Promotion",
    subtitle = "Most Respondents with Promotion in Last 2 Years",
    caption = "Median Represented by Line"
  )

ggplot(data = att, aes(x = JobRole, y = YearsSinceLastPromotion, color = JobRole)) +
  geom_violin(show.legend = FALSE) +
  geom_jitter(show.legend = FALSE, shape = 20, alpha = 0.33) +
  geom_boxplot(width = 0.15, show.legend = FALSE) +
  theme_test() +
   scale_y_continuous(breaks = round(seq(min(att$YearsSinceLastPromotion), max(att$YearsSinceLastPromotion), by = 5),1)) +
  scale_color_manual(values = c("dodgerblue4", "hotpink4", "lightyellow4", "seagreen4")) +
  labs(
    title = "Years Since Last Promotion by JobRole",
    subtitle = "Administrative Group with Greatest Variability"
  )

Job Level

ggplot(data = att, aes(x = JobLevel, fill = JobLevel)) +
  geom_bar(color = "dimgrey") +
  theme_test() +
  scale_fill_manual(values = c("azure", "azure2", "azure3", "azure4", "black")) +
  labs(
    title = "Job Level All Respondents",
    subtitle = "Greater Number of Respondents with Lower Job Levels"
  )

Quality of Life Measures

Please see each visualization’s subtitle for takeaway messages

Job Satisfaction

ggplot(data = att, aes(x = JobSatisfaction, fill = JobSatisfaction)) +
  geom_bar(color = "dimgrey", show.legend = FALSE) +
  scale_fill_manual(values = c("azure", "azure2", "azure3", "azure4")) +
  theme_test() +
  labs(
    title = "Job Satisfaction All Respondents",
    subtitle = "Generally More Satisfied Than Not"
  )

ggplot(data = att, aes(x = JobSatisfaction, fill = JobSatisfaction)) +
  geom_bar(color = "dimgrey") +
  scale_fill_manual(values = c("azure", "azure2", "azure3", "azure4")) +
  facet_wrap(~JobRole) +
  theme_test() +
  labs(
    title = "Job Satisfaction by JobRole",
    subtitle = "Nurses Relatively More Satisfied"
  )

Environment Satisfaction

ggplot(data = att, aes(x = EnvironmentSatisfaction, fill = EnvironmentSatisfaction)) +
  geom_bar(color = "dimgrey", show.legend = FALSE) +
  scale_fill_manual(values = c("azure", "azure2", "azure3", "azure4")) +
  theme_test() +
  labs(
    title = "Environment Satisfaction All Respondents",
    subtitle = "Generally More Favorable Than Not"
  )

ggplot(data = att, aes(x = EnvironmentSatisfaction, fill = EnvironmentSatisfaction)) +
  geom_bar(color = "dimgrey") +
  scale_fill_manual(values = c("azure", "azure2", "azure3", "azure4")) +
  facet_wrap(~JobRole) +
  theme_test() +
  labs(
    title = "Environment Satsifaction by JobRole",
    subtitle = "Nurses Relatively More Satsified as Compared to Other Roles")

Work Life Balance

ggplot(data = att, aes(x = WorkLifeBalance, fill = WorkLifeBalance)) +
  geom_bar(color = "dimgrey", show.legend = FALSE) +
  scale_fill_manual(values = c("azure", "azure2", "azure3", "azure4")) +
  theme_test() +
  labs(
    title = "Work Life Balance All Respondents",
    subtilte = "Generally More Satisfied Than Not"
  )

ggplot(data = att, aes(x = WorkLifeBalance, fill = WorkLifeBalance)) +
  geom_bar(color = "dimgrey") +
    scale_fill_manual(values = c("azure", "azure2", "azure3", "azure4")) +
  facet_wrap(~JobRole) +
  theme_test() +
  labs(
    title = "Work Life Balance by JobRole",
    subtitle = "Similar Experiences Across Role Groups"
  )

Relationship Satistfaction

ggplot(data = att, aes(x = RelationshipSatisfaction, fill = RelationshipSatisfaction)) +
  geom_bar(color = "dimgrey", show.legend = FALSE) +
  scale_fill_manual(values = c("azure", "azure2", "azure3", "azure4")) +
  theme_test()+
  labs(
    title = "Relationship Satisfaction All Respondents",
    subtitle = "Generally More Satisfied Than Not"
  )

ggplot(data = att, aes(x = JobSatisfaction, fill = JobSatisfaction)) +
  geom_bar(color = "dimgrey") +
    scale_fill_manual(values = c("azure", "azure2", "azure3", "azure4")) +
  facet_wrap(~JobRole) +
  theme_test() +
  labs(
    title = "Relationship Satisfaction by JobRole",
    subtitle = "Diferent Experiences Across Role Groups"
  )

Job Involvement

ggplot(data = att, aes(x = JobInvolvement, fill = JobInvolvement)) +
  geom_bar(color = "dimgrey", show.legend = FALSE) +
  scale_fill_manual(values = c("azure", "azure2", "azure3", "azure4")) +
  theme_test() +
  labs(
    title = "Job Involvement All Respondents",
    subtitle = "Generally Perceive More Engaged Than Not"
  )

ggplot(data = att, aes(x = JobInvolvement, fill = JobInvolvement)) +
  geom_bar(color = "dimgrey") +
    scale_fill_manual(values = c("azure", "azure2", "azure3", "azure4")) +
  facet_wrap(~JobRole) +
  theme_test() +
  labs(
    title = "Job Involvement by JobRole",
    subtitle = "Nurses and Other Group Perceived Greater Engagement"
  )

Performance Rating

ggplot(data = att, aes(x = PerformanceRating, fill = PerformanceRating)) +
  geom_bar(color = "dimgrey", show.legend = FALSE) +
  scale_fill_manual(values = c("azure3", "azure4")) +
  theme_test() +
  labs(
    title = "Performance Rating All Respondents",
    subtitle = "No Poorly Performing Respondents"
  )

ggplot(data = att, aes(x = PerformanceRating, fill = PerformanceRating)) +
  geom_bar(color = "dimgrey") +
  scale_fill_manual(values = c("azure3", "azure4")) +
  facet_wrap(~JobRole) +
  theme_test()+
  labs(
    title = "Performance Rating by JobRole",
    subtitle = "Similar Ratings Across Role Groups"
  )

Attrition

All Respondents and By JobRole

ggplot(data = att, aes(x = Attrition, fill = Attrition)) +
  geom_bar(color = "dimgrey", show.legend = FALSE) +
  theme_test() + 
  scale_fill_manual(values = c("red4", "green4")) +
  labs(
    title = "Attrition All Respondents",
    subtitle = "Recent Resignation or Separation Relatively Rare"
  )

ggplot(data = att, aes(x = Attrition, fill = Attrition)) +
  geom_bar(color = "dimgrey") +
  scale_fill_manual(values = c("red4", "green4")) +
  facet_wrap(~JobRole) +
  theme_test() +
  labs(
    title = "Attrition by JobRole",
    subtitle = "Relatively Rare Across Role Groups"
  )

Attrition by Gender

ggplot(data = att, aes(x = Gender, fill = Attrition)) +
  geom_bar(color = "dimgrey") +
  theme_test() +
  scale_fill_manual(values = c("red4", "green4")) +
  labs(
    title = "Attrition by Gender", 
    subtitle = "Similar Experiences for Male and Female Gender"
  )

addmargins(table(att$Gender, att$Attrition))
##         
##            No  Yes  Sum
##   Female  563   86  649
##   Male    835  111  946
##   Sum    1398  197 1595
Percent_Female_Attrition <- 86/678
Percent_Male_Attrition <- 113/998
Percent_Female_Attrition
## [1] 0.1268437
Percent_Male_Attrition
## [1] 0.1132265

Age by Attrition Status

ggplot(data = att, aes(x = Attrition, y = Age, color = Attrition)) + 
  geom_violin(show.legend = FALSE) +
  geom_jitter(show.legend = FALSE, shape = 20, alpha = 0.33) +
  geom_boxplot(width = 0.15, show.legend = FALSE) +
  theme_test() +
  scale_color_manual(values = c("red4", "green4")) +
  labs(
    title = "Age by Attrition Status",
    subtitle = "Those who Recently Left Job Appear Younger on Avg"
  )

Distance From Home by Attrition Status

ggplot(data = att, aes(x = Attrition, y = DistanceFromHome, color = Attrition)) +
  geom_violin(color = "dimgrey", show.legend = FALSE) +
  geom_jitter(show.legend = FALSE, shape = 20, alpha = 0.33) +
  geom_boxplot(width = 0.15, show.legend = FALSE) +
  theme_test() +
  scale_color_manual(values = c("red4", "green4")) +
  labs(
    title = "Travel Distance by Attrition Status",
    subtitle = "Greater Variability in Travel Distances for Those Who Left Role"
  )

Hourly Rate by Attrition Status

ggplot(data = att, aes(x = Attrition, y = HourlyRate, color = Attrition)) +
  geom_violin(show.legend = FALSE) +
  geom_jitter(show.legend = FALSE, shape = 20, alpha = 0.33) +
  geom_boxplot(width = 0.15, show.legend = FALSE) +
  theme_test() +
  scale_color_manual(values = c("red4", "green4")) +
  labs(
    title = "Hourly Rate by Attrition Status",
    subtitle = "Similar Pay Experiences"
  )

Attrition by OverTime Participation

ggplot(data = att, aes(x = Attrition, fill = Attrition)) +
  geom_bar(color = "dimgrey", show.legend = FALSE) +
  facet_wrap(~ OverTime) +
  theme_test() +
  scale_fill_manual(values = c("red4", "green4")) +
  labs(
    title = "Attrition by OverTime Participation - All Participants",
    subtitle = "Much Higher Percentage of Over Time Participants Leave Job",
    x = "No Over Time on Left, At Least Some Over Time on Right"
  )

Percent Salary Hike by Attrition Status

ggplot(data = att, aes(x = Attrition, y = PercentSalaryHike, color = Attrition)) +
  geom_violin(show.legend = FALSE) +
  geom_jitter(show.legend = FALSE, shape = 20, alpha = 0.33) +
  geom_boxplot(width = 0.15, show.legend = FALSE) +
  theme_test() +
  scale_color_manual(values = c("red4", "green4")) +
  labs(
    title = "Percent Salary Change by Attrition Status",
    subtitle = "No Difference"
  )

Total Working Years by Attrition Status

ggplot(data = att, aes(x = Attrition, y = TotalWorkingYears, color = Attrition)) +
  geom_violin(show.legend = FALSE) +
  geom_jitter(show.legend = FALSE, shape = 20, alpha = 0.33) +
  geom_boxplot(width = 0.15, show.legend = FALSE) +
  theme_test() +
  scale_color_manual(values = c("red4", "green4")) +
  labs(
    title = "Career Duration by Attrition Status",
    subtitle = "Shorter Career Duration on Avg for Those Who Left Role"
  )

Years at Current Company

ggplot(data = att, aes(x = Attrition, y = YearsAtCompany, color = Attrition)) +
  geom_violin(show.legend = FALSE) +
  geom_jitter(show.legend = FALSE, shape = 20, alpha = 0.33) +
  geom_boxplot(width = 0.15, show.legend = FALSE) +
  theme_test() +
  scale_color_manual(values = c("red4", "green4")) +
  labs(
    title = "Years with Current Company by Attrition Status",
    subtitle = "Less time with Current Company for Those who Left Role"
  )

Years in Current Role by Attrition Status

ggplot(data = att, aes(x = Attrition, y = YearsInCurrentRole, color = Attrition)) +
  geom_violin(show.legend = FALSE) +
  geom_jitter(show.legend = FALSE, shape = 20, alpha = 0.33) +
  geom_boxplot(width = 0.15, show.legend = FALSE) +
  theme_test() +
  scale_color_manual(values = c("red4", "green4")) +
  labs(
    title = "Years in Current Role by Attrition Status",
    subtitle = "Fewer Years in Current Role for Those Who Left Job"
  )

Years Since Last Promotion

ggplot(data = att, aes(x = Attrition, y = YearsSinceLastPromotion, color = Attrition)) +
  geom_violin(show.legend = FALSE) +
  geom_jitter(show.legend = FALSE, shape = 20, alpha = 0.33) +
  geom_boxplot(width = 0.15, show.legend = FALSE) +
  theme_test() +
  scale_color_manual(values = c("red4", "green4")) +
  labs(
    title = "Years Since Promotion by Attrition Status",
    subtitle = "Greater Variability for Those Who Have Not Left Job"
  )

Attrition by Environment Satisfaction

ggplot(data = att, aes(x = EnvironmentSatisfaction, fill = Attrition)) +
  geom_bar(color = "dimgrey") +
  theme_test() +
  scale_fill_manual(values = c("red4", "green4")) +
  labs(
    title = "Attrition by Environment Satisfaction Scores",
    subtitle = "Greater Percentage of Less Satsified Leave Role"
  )

Attrition by Marital Status

ggplot(data = att, aes(x = MaritalStatus, fill = Attrition)) +
  geom_bar(color = "dimgrey") +
  theme_test() +
  scale_fill_manual(values = c("red4", "green4")) +
  labs(
    title = "Attrition by Marital Status",
    subtitle = "Higher Percentage of Married Respondents Left Role"
  )

Attrition by Relationship Satisfaction

ggplot(data = att, aes(x = RelationshipSatisfaction, fill = Attrition)) +
  geom_bar(color = "dimgrey") +
  theme_test() +
  scale_fill_manual(values = c("red4", "green4")) +
  labs(
    title = "Attrition by Relationship Satisfaction Scores",
    subtitle = "Similar Attrition Rates across Grous"
  )

Attrition by Work Life Balance

ggplot(data = att, aes(x = WorkLifeBalance, fill = Attrition)) +
  geom_bar(color = "dimgrey") +
  scale_fill_manual(values = c("red4", "green4")) +
  theme_test() +
  labs(
    title = "Attrition by Work Life Balance Scores",
    subtitle = "Similar Attrition across Groups"
  )

Attrition by Job Involvement

ggplot(data = att, aes(x = JobInvolvement, fill = Attrition)) +
  geom_bar(color = "dimgrey") +
  theme_test() +
  scale_fill_manual(values = c("red4", "green4")) +
  labs(
    title = "Attrition by Job Involvement Scores", 
    subtitle = "Higher Percentage of Lower Scores Leave Role"
  )

Attrition by Job Satisfaction

ggplot(data = att, aes(x = JobSatisfaction, fill = Attrition)) +
  geom_bar(color = "dimgrey") +
  theme_test() +
  scale_fill_manual(values = c("red4", "green4")) +
  labs(
    title = "Attrition by Job Satisfaction",
    subtitle = "Similar Gross Attrition but Greater Percentage for Lower Scores"
  )

Attrition by Performance Rating

ggplot(data = att, aes(x = PerformanceRating, fill = Attrition)) +
  geom_bar(color = "dimgrey") +
  theme_test() +
  scale_fill_manual(values = c("red4", "green4"))+
  labs(
    title = "Attrition by Performance Rating",
    subtitle = "Similar Attrition Rates across High Performing Respondents"
  )

addmargins(table(att$Attrition, att$PerformanceRating))
##      
##          1    2    3    4  Sum
##   No     0    0 1186  212 1398
##   Yes    0    0  165   32  197
##   Sum    0    0 1351  244 1595
percent_attrition_score3 <- 167/1424
percent_attrition_score4 <- 32/252
percent_attrition_score3
## [1] 0.1172753
percent_attrition_score4
## [1] 0.1269841

Attrition by Job Level

ggplot(data = att, aes(x = JobLevel, fill = Attrition)) +
  geom_bar(color = "dimgrey") +
  theme_test() +
  scale_fill_manual(values = c("red4", "green4")) +
  labs(
    title = "Attrition by Job Level",
    subtitle = "Hihger Percentage of Respondents with Lowest Job Level Leave Role"
  )

Initial Log Regression Models

Dividing the Dataset

I divided the dataset with observations with NA for JobLevel removed into a training dataset and testing dataset with 80% of the observations going to former and 20% to the latter.

train_set_size <- floor(0.80 * nrow(att))
set.seed(316)
train_div<-sample(seq_len(nrow(att)), size = train_set_size)
train.att<-att[train_div, ]
test.att <- att[-train_div, ]
dim(train.att)
## [1] 1276   35
dim(test.att)
## [1] 319  35

Model Specificiation Rationales

To identify relevant independent variables to include in the logistic regression models, I focused on the Attrition visualizations and anecdotal evidence shared by healthcare workers regarding their rationales for leaving their jobs.

Specific independent variables included initially include JobRole, Age, DistanceFromHome, TotalWorkingYears, YearsAtCompany YearsInCurrentRole, YearsSinceLastPromotion, EnvironmentSatisfaction, OverTime, MartialStatus, JobInvolvement, JobSatisfaction, and JobLevel.

When considering the temporal variables, there is likely some degree of correlation between Age and TotalWorkingYears, YearsAtCompany, and YearsInCurrentRole. Given this, I will create an additional model with the non-Age temporal variables removed to see how they perform against each other.

It is possible there are associations between Age and MaritalStatus and JobLevel as well. I will pull these variables in a third model that will only include JobRole, Age, DistanceFromHome, YearsSinceLastPromotion, EnvironmentalSatisfaction, OverTime, JobInvolvement, and JobSatisfaction.

Lastly, I have been considering JobRole as a grouping variable and using fixed effects in the first three models. For the fourth model, I will explore a mixed effects model using “random” intercepts for the different role groups. Further grouping or levels is limited by the nature of this dataset.

Models & Comparing Fit on Training Data

if (!require(lme4)) install.packages("lme4")
## Loading required package: lme4
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
library(lme4)
if (!require(jtools)) install.packages("jtools")
## Loading required package: jtools
library(jtools)
if (!require(ResourceSelection)) install.packages("ResourceSelection")
## Loading required package: ResourceSelection
## ResourceSelection 0.3-6   2023-06-27
library(ResourceSelection)
if (!require(lmtest)) install.packages("lmtest")
## Loading required package: lmtest
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(lmtest)
reg1 <- glm(Attrition ~ JobRole+Age+DistanceFromHome+TotalWorkingYears+ YearsAtCompany + YearsInCurrentRole + YearsSinceLastPromotion + EnvironmentSatisfaction + OverTime + MaritalStatus + JobInvolvement + JobSatisfaction + JobLevel, data = train.att, family = "binomial")
summ(reg1, exp = TRUE, confint = TRUE)
Observations 1276
Dependent variable Attrition
Type Generalized linear model
Family binomial
Link logit
χ²(24) 508.11
Pseudo-R² (Cragg-Uhler) 0.63
Pseudo-R² (McFadden) 0.54
AIC 485.75
BIC 614.54
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.00 0.00 Inf -0.03 0.98
JobRoleNurse 11946697.92 0.00 Inf 0.03 0.98
JobRoleOther 8477435.94 0.00 Inf 0.03 0.98
JobRoleTherapist 4926000.76 0.00 Inf 0.03 0.98
Age 0.91 0.87 0.95 -4.52 0.00
DistanceFromHome 1.09 1.06 1.12 5.44 0.00
TotalWorkingYears 0.97 0.90 1.04 -0.84 0.40
YearsAtCompany 0.86 0.75 0.99 -2.08 0.04
YearsInCurrentRole 0.83 0.70 0.99 -2.07 0.04
YearsSinceLastPromotion 1.27 1.10 1.46 3.25 0.00
EnvironmentSatisfaction.L 0.20 0.12 0.33 -5.98 0.00
EnvironmentSatisfaction.Q 2.06 1.22 3.47 2.72 0.01
EnvironmentSatisfaction.C 1.00 0.60 1.67 0.01 0.99
OverTimeYes 46.27 23.78 90.05 11.29 0.00
MaritalStatusMarried 1.71 0.83 3.54 1.44 0.15
MaritalStatusSingle 10.05 4.74 21.33 6.01 0.00
JobInvolvement.L 0.12 0.05 0.29 -4.61 0.00
JobInvolvement.Q 1.17 0.57 2.38 0.43 0.67
JobInvolvement.C 0.93 0.58 1.50 -0.28 0.78
JobSatisfaction.L 0.36 0.22 0.61 -3.80 0.00
JobSatisfaction.Q 0.92 0.54 1.56 -0.30 0.76
JobSatisfaction.C 0.75 0.44 1.27 -1.08 0.28
JobLevel.L 2.25 0.45 11.12 0.99 0.32
JobLevel.Q 4.49 1.46 13.76 2.63 0.01
JobLevel.C 0.39 0.18 0.84 -2.39 0.02
Standard errors: MLE
hoslem.test(reg1$y,predict(reg1, type = "response"))
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  reg1$y, predict(reg1, type = "response")
## X-squared = 6.2056, df = 8, p-value = 0.6242
reg2 <- glm(Attrition ~ JobRole+Age+DistanceFromHome + YearsSinceLastPromotion + EnvironmentSatisfaction + OverTime + MaritalStatus + JobInvolvement + JobSatisfaction + JobLevel, data = train.att, family = "binomial")
summ(reg2, exp = TRUE, confint = TRUE)
Observations 1276
Dependent variable Attrition
Type Generalized linear model
Family binomial
Link logit
χ²(21) 469.94
Pseudo-R² (Cragg-Uhler) 0.59
Pseudo-R² (McFadden) 0.50
AIC 517.92
BIC 631.25
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.00 0.00 Inf -0.03 0.98
JobRoleNurse 7546941.07 0.00 Inf 0.02 0.98
JobRoleOther 4952349.21 0.00 Inf 0.02 0.98
JobRoleTherapist 3309751.64 0.00 Inf 0.02 0.98
Age 0.89 0.86 0.92 -6.09 0.00
DistanceFromHome 1.08 1.05 1.12 5.33 0.00
YearsSinceLastPromotion 0.99 0.89 1.10 -0.17 0.86
EnvironmentSatisfaction.L 0.23 0.14 0.37 -5.79 0.00
EnvironmentSatisfaction.Q 1.85 1.13 3.02 2.45 0.01
EnvironmentSatisfaction.C 1.00 0.61 1.64 0.01 0.99
OverTimeYes 37.63 20.23 69.98 11.46 0.00
MaritalStatusMarried 1.75 0.87 3.54 1.56 0.12
MaritalStatusSingle 8.69 4.22 17.92 5.86 0.00
JobInvolvement.L 0.11 0.04 0.25 -5.00 0.00
JobInvolvement.Q 1.11 0.55 2.22 0.29 0.77
JobInvolvement.C 0.91 0.58 1.44 -0.40 0.69
JobSatisfaction.L 0.38 0.23 0.63 -3.80 0.00
JobSatisfaction.Q 0.77 0.47 1.27 -1.02 0.31
JobSatisfaction.C 0.85 0.52 1.40 -0.62 0.53
JobLevel.L 0.83 0.23 2.99 -0.29 0.77
JobLevel.Q 3.77 1.41 10.09 2.64 0.01
JobLevel.C 0.34 0.17 0.67 -3.07 0.00
Standard errors: MLE
lrtest(reg1, reg2)
## Likelihood ratio test
## 
## Model 1: Attrition ~ JobRole + Age + DistanceFromHome + TotalWorkingYears + 
##     YearsAtCompany + YearsInCurrentRole + YearsSinceLastPromotion + 
##     EnvironmentSatisfaction + OverTime + MaritalStatus + JobInvolvement + 
##     JobSatisfaction + JobLevel
## Model 2: Attrition ~ JobRole + Age + DistanceFromHome + YearsSinceLastPromotion + 
##     EnvironmentSatisfaction + OverTime + MaritalStatus + JobInvolvement + 
##     JobSatisfaction + JobLevel
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1  25 -217.88                         
## 2  22 -236.96 -3 38.165  2.607e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
hoslem.test(reg2$y, predict(reg2, type = "response"))
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  reg2$y, predict(reg2, type = "response")
## X-squared = 3.0868, df = 8, p-value = 0.9288
reg3 <- glm(Attrition ~ JobRole+Age+DistanceFromHome + YearsSinceLastPromotion + EnvironmentSatisfaction + OverTime + JobInvolvement + JobSatisfaction, data = train.att, family = "binomial")
summ(reg3, exp = TRUE, confint = TRUE)
Observations 1276
Dependent variable Attrition
Type Generalized linear model
Family binomial
Link logit
χ²(16) 366.50
Pseudo-R² (Cragg-Uhler) 0.48
Pseudo-R² (McFadden) 0.39
AIC 611.36
BIC 698.94
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.00 0.00 Inf -0.02 0.98
JobRoleNurse 4758164.12 0.00 Inf 0.02 0.98
JobRoleOther 6108699.25 0.00 Inf 0.02 0.98
JobRoleTherapist 1128628.72 0.00 Inf 0.02 0.98
Age 0.87 0.84 0.90 -8.03 0.00
DistanceFromHome 1.06 1.03 1.08 4.23 0.00
YearsSinceLastPromotion 0.98 0.89 1.06 -0.55 0.58
EnvironmentSatisfaction.L 0.32 0.21 0.49 -5.25 0.00
EnvironmentSatisfaction.Q 2.08 1.33 3.25 3.20 0.00
EnvironmentSatisfaction.C 0.88 0.56 1.38 -0.57 0.57
OverTimeYes 18.57 11.30 30.53 11.52 0.00
JobInvolvement.L 0.14 0.06 0.30 -4.95 0.00
JobInvolvement.Q 0.99 0.53 1.85 -0.02 0.98
JobInvolvement.C 0.85 0.57 1.28 -0.76 0.44
JobSatisfaction.L 0.39 0.25 0.61 -4.20 0.00
JobSatisfaction.Q 0.78 0.50 1.22 -1.08 0.28
JobSatisfaction.C 0.85 0.54 1.33 -0.71 0.48
Standard errors: MLE
hoslem.test(reg3$y, predict(reg3, type = "response"))
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  reg3$y, predict(reg3, type = "response")
## X-squared = 5.6272, df = 8, p-value = 0.6889
reg4 <- glmer(Attrition ~ Age + DistanceFromHome + TotalWorkingYears + YearsAtCompany + YearsInCurrentRole + YearsSinceLastPromotion + EnvironmentSatisfaction + OverTime + MaritalStatus + JobInvolvement + JobSatisfaction + JobLevel + (1|JobRole), data = train.att, family = "binomial")
summ(reg4, exp = TRUE, confint = TRUE)
Observations 1276
Dependent variable Attrition
Type Mixed effects generalized linear model
Family binomial
Link logit
AIC 493.29
BIC 611.77
Pseudo-R² (fixed effects) 0.75
Pseudo-R² (total) 0.75
Fixed Effects
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.27 0.05 1.35 -1.60 0.11
Age 0.90 0.86 0.94 -4.70 0.00
DistanceFromHome 1.09 1.05 1.12 5.39 0.00
TotalWorkingYears 0.97 0.90 1.05 -0.76 0.45
YearsAtCompany 0.88 0.78 1.00 -2.04 0.04
YearsInCurrentRole 0.82 0.70 0.97 -2.38 0.02
YearsSinceLastPromotion 1.27 1.10 1.47 3.33 0.00
EnvironmentSatisfaction.L 0.20 0.12 0.34 -5.95 0.00
EnvironmentSatisfaction.Q 2.04 1.21 3.42 2.69 0.01
EnvironmentSatisfaction.C 0.99 0.60 1.65 -0.03 0.97
OverTimeYes 44.56 23.03 86.19 11.28 0.00
MaritalStatusMarried 1.54 0.74 3.17 1.16 0.25
MaritalStatusSingle 9.26 4.40 19.48 5.87 0.00
JobInvolvement.L 0.12 0.05 0.28 -4.73 0.00
JobInvolvement.Q 1.17 0.58 2.37 0.45 0.65
JobInvolvement.C 0.98 0.61 1.56 -0.10 0.92
JobSatisfaction.L 0.36 0.22 0.60 -3.90 0.00
JobSatisfaction.Q 0.97 0.58 1.63 -0.11 0.91
JobSatisfaction.C 0.70 0.42 1.17 -1.36 0.18
JobLevel.L 0.83 0.21 3.27 -0.27 0.79
JobLevel.Q 2.61 0.98 6.95 1.92 0.05
JobLevel.C 0.35 0.17 0.72 -2.85 0.00
Random Effects
Group Parameter Std. Dev.
JobRole (Intercept) 0.13
Grouping Variables
Group # groups ICC
JobRole 4 0.00
lrtest(reg1, reg2)
## Likelihood ratio test
## 
## Model 1: Attrition ~ JobRole + Age + DistanceFromHome + TotalWorkingYears + 
##     YearsAtCompany + YearsInCurrentRole + YearsSinceLastPromotion + 
##     EnvironmentSatisfaction + OverTime + MaritalStatus + JobInvolvement + 
##     JobSatisfaction + JobLevel
## Model 2: Attrition ~ JobRole + Age + DistanceFromHome + YearsSinceLastPromotion + 
##     EnvironmentSatisfaction + OverTime + MaritalStatus + JobInvolvement + 
##     JobSatisfaction + JobLevel
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1  25 -217.88                         
## 2  22 -236.96 -3 38.165  2.607e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lrtest(reg1, reg3)
## Likelihood ratio test
## 
## Model 1: Attrition ~ JobRole + Age + DistanceFromHome + TotalWorkingYears + 
##     YearsAtCompany + YearsInCurrentRole + YearsSinceLastPromotion + 
##     EnvironmentSatisfaction + OverTime + MaritalStatus + JobInvolvement + 
##     JobSatisfaction + JobLevel
## Model 2: Attrition ~ JobRole + Age + DistanceFromHome + YearsSinceLastPromotion + 
##     EnvironmentSatisfaction + OverTime + JobInvolvement + JobSatisfaction
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1  25 -217.88                         
## 2  17 -288.68 -8 141.61  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lrtest(reg2, reg3)
## Likelihood ratio test
## 
## Model 1: Attrition ~ JobRole + Age + DistanceFromHome + YearsSinceLastPromotion + 
##     EnvironmentSatisfaction + OverTime + MaritalStatus + JobInvolvement + 
##     JobSatisfaction + JobLevel
## Model 2: Attrition ~ JobRole + Age + DistanceFromHome + YearsSinceLastPromotion + 
##     EnvironmentSatisfaction + OverTime + JobInvolvement + JobSatisfaction
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1  22 -236.96                         
## 2  17 -288.68 -5 103.45  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Diagnostic Testing Reg1

I selected the reg1 model over the other three as there was evidence that one of the models outperformed the other fixed effects models. While Reg4 had the highest pseudo-R^2, reg1 had a lower AIC. Also, reg4’s ICC was so poor that I am unsure of the value of including the random intercept for JobRole.

Residual Independence

Residual Plots

if (!require(car)) install.packages("car")
## Loading required package: car
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
library(car)
residualPlots(reg1)

##                         Test stat Pr(>|Test stat|)  
## JobRole                                             
## Age                        5.4193          0.01992 *
## DistanceFromHome           0.2694          0.60376  
## TotalWorkingYears          1.2127          0.27079  
## YearsAtCompany             2.4149          0.12018  
## YearsInCurrentRole         0.8365          0.36041  
## YearsSinceLastPromotion    2.2510          0.13353  
## EnvironmentSatisfaction                             
## OverTime                                            
## MaritalStatus                                       
## JobInvolvement                                      
## JobSatisfaction                                     
## JobLevel                                            
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the above plots, there are little to no relationships between the residuals and the continuous independent variables in reg1. Interestingly, the p-value for age is quite small. If I consider this one piece of evidence alone, I need to consider rejecting my null hypothesis that age fits the model well. I think the best course of action may be to compare two additional models, one with and one without age, alongside any additional changes that may arise with the continued assumption testing.

Linearity of Relationships Between Continuous Ind Var and Natural Logs of Odds

marginalModelPlots(reg1)
## Warning in mmps(...): Interactions and/or factors skipped

The estimated relationship between independent variables and Attrition and the model predictions for the relationship is strong as evidenced by the consistent overlap of the “Data” and “Model” lines. However, these relationships deviate from being linear.

linearity.check <- train.att %>%
  select(Age, DistanceFromHome, TotalWorkingYears, YearsAtCompany, YearsInCurrentRole, YearsSinceLastPromotion)
predictors <- colnames(linearity.check)
linearity.check$probabilities <- reg1$fitted.values
linearity.check <- linearity.check %>%
  mutate(logit = log(probabilities/(1 - probabilities))) %>%
  select(-probabilities) %>%
  gather(key = "predictors", value = "predictor.values", -logit)
ggplot(linearity.check, aes(y = logit, x = predictor.values)) +
  geom_point() +
  stat_smooth(method = "loess") +
  theme_test() +
  facet_wrap(~predictors, scales = "free_x")
## `geom_smooth()` using formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at -0.075
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 2.075
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 2.4351e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 4
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## -0.075
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 2.075
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 2.4351e-15
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 4

I decided to assess the actual logits against the continuous independent variables. These grossly appear linear. What is more notable to me is the clustering of lower value logits. The way I visualized these plots disallows me from identifying if these reflect a specific level of a categorical independent variable.

For now, I will not plan on adding any polynomial terms or any other transformations.

Nonadditivity

Conceptually, there could be some degree of interaction between or among these independent variables that I have not addressed. I am most concerned about an interaction between JobLevel and the temporal variables.

Multicollineartiy

Correlation Matrix for Continuous Independent Variables

reg1.cont.ind <- att[c("Age", "DistanceFromHome", "TotalWorkingYears", "YearsAtCompany", "YearsInCurrentRole", "YearsSinceLastPromotion")]
cor(reg1.cont.ind)
##                                Age DistanceFromHome TotalWorkingYears
## Age                     1.00000000       0.01533062        0.66155501
## DistanceFromHome        0.01533062       1.00000000        0.02189604
## TotalWorkingYears       0.66155501       0.02189604        1.00000000
## YearsAtCompany          0.29307296       0.02097758        0.61806113
## YearsInCurrentRole      0.21150996       0.02618534        0.47253042
## YearsSinceLastPromotion 0.19748605       0.01630781        0.39211092
##                         YearsAtCompany YearsInCurrentRole
## Age                         0.29307296         0.21150996
## DistanceFromHome            0.02097758         0.02618534
## TotalWorkingYears           0.61806113         0.47253042
## YearsAtCompany              1.00000000         0.77189266
## YearsInCurrentRole          0.77189266         1.00000000
## YearsSinceLastPromotion     0.61159314         0.54189737
##                         YearsSinceLastPromotion
## Age                                  0.19748605
## DistanceFromHome                     0.01630781
## TotalWorkingYears                    0.39211092
## YearsAtCompany                       0.61159314
## YearsInCurrentRole                   0.54189737
## YearsSinceLastPromotion              1.00000000
vif(reg1)
##                             GVIF Df GVIF^(1/(2*Df))
## JobRole                 1.486018  3        1.068245
## Age                     1.991913  1        1.411351
## DistanceFromHome        1.221443  1        1.105189
## TotalWorkingYears       3.268777  1        1.807976
## YearsAtCompany          4.621883  1        2.149857
## YearsInCurrentRole      3.343026  1        1.828394
## YearsSinceLastPromotion 2.280921  1        1.510272
## EnvironmentSatisfaction 1.506820  3        1.070722
## OverTime                1.860166  1        1.363879
## MaritalStatus           1.350665  2        1.078045
## JobInvolvement          1.224774  3        1.034370
## JobSatisfaction         1.263163  3        1.039704
## JobLevel                3.942323  3        1.256875

As expected in the initial model specifications, there are moderate to strong relationships among the temporal independent variables. Notably, Age and TotalWorkingYears, TotalWorkingYears and YearsAtCompany, and YearsInCurrentRole and YearsAtCompany.

When considering the Variance Inflation Factor, YearsAtCompany is relatively large. I think it may be most reasonable to generate a new model by retaining all of the independent variables save for YearsAtCompany. I will also consider whether or note to include Age based on comparing two models, one with and one without Age.

Zero Cell Count

No participants rated their JobPerformance below a 3. Fortunately, I did not include this variable in any of my models. Therefore, I have no concerns regarding this modelling assumption.

Complete Separation of Prediction

No independent variables perfectly or near perfectly predict the probabilities for Attrition.

Outliers and Influential Observations

outlierTest(reg1, labels = train.att$EmployeeID)
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
##         rstudent unadjusted p-value Bonferroni p
## 1536328  3.30954         0.00093449           NA
influencePlot(reg1)

##            StudRes         Hat        CookD
## 1601 -0.0009942438 0.368433609 1.413778e-08
## 317  -0.9500951052 0.239881472 7.645568e-03
## 1221  3.1026592081 0.024904990 6.059328e-02
## 964   3.3095401360 0.005506775 3.446135e-02
## 973   2.2994156966 0.134289854 5.092781e-02

Observations 317, 964, 973, 1221, and 1601 demonstrated significant influence. I will investigate these observations by first by pulling them into their own table. I will then highlight them on scatterplots of the continuous variables.

if(!require(gridExtra)) install.packages("gridExtra")
## Loading required package: gridExtra
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(gridExtra)
Outliers.reg1 <- train.att %>%
  filter(EmployeeID %in% c(1230043, 1536328, 1081600, 1675357, 1553672)) %>%
  select(EmployeeID, JobRole, Age, DistanceFromHome, TotalWorkingYears, YearsAtCompany, YearsInCurrentRole, YearsSinceLastPromotion, EnvironmentSatisfaction, OverTime, MaritalStatus, JobInvolvement, JobSatisfaction, JobLevel)
print(Outliers.reg1)
##   EmployeeID        JobRole Age DistanceFromHome TotalWorkingYears
## 1    1553672 Administrative  47               29                10
## 2    1230043          Nurse  49                1                25
## 3    1675357          Nurse  47                9                25
## 4    1536328      Therapist  58                7                31
## 5    1081600          Nurse  55               13                24
##   YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion
## 1             10                  7                       9
## 2              7                  1                       0
## 3             23                  5                      14
## 4             10                  9                       5
## 5             19                  7                       3
##   EnvironmentSatisfaction OverTime MaritalStatus JobInvolvement JobSatisfaction
## 1                       1      Yes       Married              3               2
## 2                       3      Yes        Single              2               3
## 3                       3       No       Married              1               3
## 4                       3      Yes       Married              2               1
## 5                       1      Yes        Single              4               3
##   JobLevel
## 1        3
## 2        4
## 3        4
## 4        3
## 5        4
ggplot(data = train.att, aes(x = Age)) +
  geom_histogram(binwidth = 2, color = "dimgrey", fill = "grey65") +
  annotate(geom = "text", x = 47, y = 10, label = "155367", color = "dodgerblue4", size = 2.5) + 
  annotate(geom = "text", x = 49, y = 5, label = "1230043", color = "hotpink4", size = 2.5) +
   annotate(geom = "text", x = 47, y = 20, label = "1675357", color = "hotpink4", size = 2.5) +
   annotate(geom = "text", x = 58, y = 23, label = "1536328", color = "seagreen4", size = 2.5) +
   annotate(geom = "text", x = 55, y = 20, label = "1081600", color = "hotpink4", size = 2.5) + 
  theme_test() +
  labs(
    x = "Age",
    y = "Count",
    title = "Potential Outliers' Ages",
    subtitle = "All toward Upper Limit but Reasonable Values"
  )

New Models

Based on the Assumption Diagnostic Testing, I decided to definitively remove YearsAtCompany. Lastly, I will create two models, one with Age and one without Age.

New Models

reg5 <- glm(Attrition ~ JobRole+Age+DistanceFromHome+TotalWorkingYears + YearsInCurrentRole + YearsSinceLastPromotion + EnvironmentSatisfaction + OverTime + MaritalStatus + JobInvolvement + JobSatisfaction + JobLevel, data = train.att, family = "binomial")
summ(reg5, exp = TRUE, confint = TRUE)
Observations 1276
Dependent variable Attrition
Type Generalized linear model
Family binomial
Link logit
χ²(23) 503.25
Pseudo-R² (Cragg-Uhler) 0.62
Pseudo-R² (McFadden) 0.53
AIC 488.61
BIC 612.24
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.00 0.00 Inf -0.03 0.98
JobRoleNurse 10302592.11 0.00 Inf 0.03 0.98
JobRoleOther 6940935.92 0.00 Inf 0.03 0.98
JobRoleTherapist 4182039.74 0.00 Inf 0.03 0.98
Age 0.91 0.87 0.95 -4.44 0.00
DistanceFromHome 1.09 1.06 1.12 5.37 0.00
TotalWorkingYears 0.95 0.88 1.02 -1.38 0.17
YearsInCurrentRole 0.74 0.65 0.84 -4.73 0.00
YearsSinceLastPromotion 1.21 1.06 1.38 2.84 0.00
EnvironmentSatisfaction.L 0.20 0.12 0.34 -5.95 0.00
EnvironmentSatisfaction.Q 1.96 1.17 3.27 2.57 0.01
EnvironmentSatisfaction.C 1.06 0.64 1.76 0.24 0.81
OverTimeYes 44.70 23.16 86.27 11.33 0.00
MaritalStatusMarried 1.74 0.84 3.59 1.49 0.14
MaritalStatusSingle 9.87 4.67 20.85 6.00 0.00
JobInvolvement.L 0.12 0.05 0.29 -4.67 0.00
JobInvolvement.Q 1.19 0.58 2.43 0.48 0.63
JobInvolvement.C 0.92 0.58 1.48 -0.33 0.74
JobSatisfaction.L 0.36 0.21 0.60 -3.90 0.00
JobSatisfaction.Q 0.96 0.57 1.63 -0.14 0.89
JobSatisfaction.C 0.72 0.43 1.21 -1.24 0.22
JobLevel.L 1.28 0.28 5.93 0.32 0.75
JobLevel.Q 3.03 1.04 8.78 2.04 0.04
JobLevel.C 0.32 0.15 0.67 -3.01 0.00
Standard errors: MLE
reg6 <- glm(Attrition ~ JobRole+ DistanceFromHome+TotalWorkingYears + YearsInCurrentRole + YearsSinceLastPromotion + EnvironmentSatisfaction + OverTime +MaritalStatus + JobInvolvement + JobSatisfaction + JobLevel, data = train.att, family = "binomial")
summ(reg6, exp = TRUE, confint = TRUE)
Observations 1276
Dependent variable Attrition
Type Generalized linear model
Family binomial
Link logit
χ²(22) 479.92
Pseudo-R² (Cragg-Uhler) 0.60
Pseudo-R² (McFadden) 0.51
AIC 509.94
BIC 628.42
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.00 0.00 Inf -0.03 0.97
JobRoleNurse 15450797.13 0.00 Inf 0.03 0.98
JobRoleOther 10140302.85 0.00 Inf 0.03 0.98
JobRoleTherapist 6060016.86 0.00 Inf 0.03 0.98
DistanceFromHome 1.09 1.05 1.12 5.32 0.00
TotalWorkingYears 0.87 0.81 0.93 -4.22 0.00
YearsInCurrentRole 0.76 0.68 0.87 -4.25 0.00
YearsSinceLastPromotion 1.20 1.05 1.36 2.78 0.01
EnvironmentSatisfaction.L 0.22 0.13 0.37 -5.80 0.00
EnvironmentSatisfaction.Q 1.91 1.16 3.14 2.55 0.01
EnvironmentSatisfaction.C 1.00 0.61 1.63 -0.01 0.99
OverTimeYes 33.64 18.46 61.31 11.48 0.00
MaritalStatusMarried 1.65 0.81 3.37 1.38 0.17
MaritalStatusSingle 9.58 4.65 19.73 6.13 0.00
JobInvolvement.L 0.12 0.05 0.28 -4.79 0.00
JobInvolvement.Q 1.20 0.60 2.42 0.52 0.60
JobInvolvement.C 0.90 0.57 1.42 -0.46 0.64
JobSatisfaction.L 0.39 0.24 0.64 -3.71 0.00
JobSatisfaction.Q 1.10 0.66 1.83 0.37 0.71
JobSatisfaction.C 0.71 0.42 1.18 -1.33 0.18
JobLevel.L 1.56 0.35 6.96 0.58 0.56
JobLevel.Q 3.32 1.16 9.53 2.23 0.03
JobLevel.C 0.36 0.17 0.75 -2.73 0.01
Standard errors: MLE

There was little change in compartive model performance with Age removed. Given this, I will move forward with diagnostic testing of Reg6.

Diagnostic Testing Reg 6

residualPlots(reg6)

##                         Test stat Pr(>|Test stat|)  
## JobRole                                             
## DistanceFromHome           0.0547          0.81506  
## TotalWorkingYears          3.8661          0.04927 *
## YearsInCurrentRole         1.7947          0.18036  
## YearsSinceLastPromotion    2.8233          0.09290 .
## EnvironmentSatisfaction                             
## OverTime                                            
## MaritalStatus                                       
## JobInvolvement                                      
## JobSatisfaction                                     
## JobLevel                                            
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
marginalModelPlots(reg6)
## Warning in mmps(...): Interactions and/or factors skipped

linearity.check.new <-train.att %>%
  select(Age, DistanceFromHome, TotalWorkingYears, YearsInCurrentRole, YearsSinceLastPromotion)
predictors.new <- colnames(linearity.check.new)
linearity.check.new$probabilities <- reg6$fitted.values
linearity.check.new <- linearity.check.new %>%
  mutate(logit = log(probabilities/(1 - probabilities))) %>%
  select(-probabilities) %>%
  gather(key = "predictors", value = "predictor.values", -logit)
ggplot(linearity.check.new, aes(y = logit, x = predictor.values)) +
  geom_point() +
  stat_smooth(method = "loess") +
  theme_test() +
  facet_wrap(~predictors, scales = "free_x")
## `geom_smooth()` using formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at -0.075
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 2.075
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 2.4351e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 4
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## -0.075
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 2.075
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 2.4351e-15
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 4

reg6.cont.ind <- att[c("DistanceFromHome", "TotalWorkingYears", "YearsInCurrentRole", "YearsSinceLastPromotion")]
cor(reg6.cont.ind)
##                         DistanceFromHome TotalWorkingYears YearsInCurrentRole
## DistanceFromHome              1.00000000        0.02189604         0.02618534
## TotalWorkingYears             0.02189604        1.00000000         0.47253042
## YearsInCurrentRole            0.02618534        0.47253042         1.00000000
## YearsSinceLastPromotion       0.01630781        0.39211092         0.54189737
##                         YearsSinceLastPromotion
## DistanceFromHome                     0.01630781
## TotalWorkingYears                    0.39211092
## YearsInCurrentRole                   0.54189737
## YearsSinceLastPromotion              1.00000000
vif(reg6)
##                             GVIF Df GVIF^(1/(2*Df))
## JobRole                 1.502246  3        1.070180
## DistanceFromHome        1.198696  1        1.094850
## TotalWorkingYears       2.467401  1        1.570796
## YearsInCurrentRole      1.743091  1        1.320262
## YearsSinceLastPromotion 1.880525  1        1.371322
## EnvironmentSatisfaction 1.382737  3        1.055496
## OverTime                1.602801  1        1.266018
## MaritalStatus           1.296886  2        1.067150
## JobInvolvement          1.215446  3        1.033053
## JobSatisfaction         1.217971  3        1.033410
## JobLevel                3.427611  3        1.227906

Similar to the issue of Age in the initial model iterations, I now have evidence that TotalWorkingYears does not fit the logistic regression model well. I will generate an additional model with this variable removed and compare its performance against reg6.

Reg7

reg7 <- glm(Attrition ~ JobRole+ DistanceFromHome + YearsInCurrentRole + YearsSinceLastPromotion + EnvironmentSatisfaction + OverTime +MaritalStatus + JobInvolvement + JobSatisfaction + JobLevel, data = train.att, family = "binomial")
summ(reg7, exp = TRUE, confint = TRUE)
Observations 1276
Dependent variable Attrition
Type Generalized linear model
Family binomial
Link logit
χ²(21) 458.45
Pseudo-R² (Cragg-Uhler) 0.58
Pseudo-R² (McFadden) 0.49
AIC 529.41
BIC 642.74
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.00 0.00 Inf -0.04 0.97
JobRoleNurse 15200187.77 0.00 Inf 0.03 0.98
JobRoleOther 10069415.82 0.00 Inf 0.03 0.98
JobRoleTherapist 5417490.74 0.00 Inf 0.03 0.98
DistanceFromHome 1.08 1.05 1.11 5.30 0.00
YearsInCurrentRole 0.73 0.65 0.82 -5.36 0.00
YearsSinceLastPromotion 1.16 1.03 1.31 2.46 0.01
EnvironmentSatisfaction.L 0.25 0.15 0.41 -5.44 0.00
EnvironmentSatisfaction.Q 1.65 1.02 2.67 2.06 0.04
EnvironmentSatisfaction.C 0.93 0.58 1.49 -0.31 0.75
OverTimeYes 28.47 16.03 50.57 11.43 0.00
MaritalStatusMarried 1.81 0.89 3.67 1.64 0.10
MaritalStatusSingle 10.42 5.07 21.39 6.38 0.00
JobInvolvement.L 0.11 0.05 0.26 -5.05 0.00
JobInvolvement.Q 1.18 0.60 2.32 0.49 0.63
JobInvolvement.C 0.88 0.56 1.38 -0.57 0.57
JobSatisfaction.L 0.44 0.27 0.70 -3.40 0.00
JobSatisfaction.Q 0.96 0.59 1.57 -0.16 0.87
JobSatisfaction.C 0.79 0.49 1.30 -0.92 0.36
JobLevel.L 0.28 0.08 0.99 -1.98 0.05
JobLevel.Q 2.50 0.91 6.85 1.78 0.07
JobLevel.C 0.32 0.16 0.66 -3.09 0.00
Standard errors: MLE

There is little change in model performance with and without TotalWorkingYears. Given this, I will move ahead with diagnostic testing of assumptions for the newest model, reg7.

Diagnostic Testing of Assumptions Reg7

residualPlots(reg7)

##                         Test stat Pr(>|Test stat|)
## JobRole                                           
## DistanceFromHome           0.3125           0.5762
## YearsInCurrentRole         1.1908           0.2752
## YearsSinceLastPromotion    2.0170           0.1555
## EnvironmentSatisfaction                           
## OverTime                                          
## MaritalStatus                                     
## JobInvolvement                                    
## JobSatisfaction                                   
## JobLevel
marginalModelPlots(reg7)
## Warning in mmps(...): Interactions and/or factors skipped

linearity.check.new <-train.att %>%
  select(DistanceFromHome, YearsInCurrentRole, YearsSinceLastPromotion)
predictors.new <- colnames(linearity.check.new)
linearity.check.new$probabilities <- reg7$fitted.values
linearity.check.new <- linearity.check.new %>%
  mutate(logit = log(probabilities/(1 - probabilities))) %>%
  select(-probabilities) %>%
  gather(key = "predictors", value = "predictor.values", -logit)
ggplot(linearity.check.new, aes(y = logit, x = predictor.values)) +
  geom_point() +
  stat_smooth(method = "loess") +
  theme_test() +
  facet_wrap(~predictors, scales = "free_x")
## `geom_smooth()` using formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at -0.075
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 2.075
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 2.4351e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 4
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## -0.075
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 2.075
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 2.4351e-15
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 4

reg7.cont.ind <- att[c("DistanceFromHome", "YearsInCurrentRole", "YearsSinceLastPromotion")]
cor(reg7.cont.ind)
##                         DistanceFromHome YearsInCurrentRole
## DistanceFromHome              1.00000000         0.02618534
## YearsInCurrentRole            0.02618534         1.00000000
## YearsSinceLastPromotion       0.01630781         0.54189737
##                         YearsSinceLastPromotion
## DistanceFromHome                     0.01630781
## YearsInCurrentRole                   0.54189737
## YearsSinceLastPromotion              1.00000000
vif(reg7)
##                             GVIF Df GVIF^(1/(2*Df))
## JobRole                 1.452671  3        1.064211
## DistanceFromHome        1.180258  1        1.086397
## YearsInCurrentRole      1.636493  1        1.279255
## YearsSinceLastPromotion 1.810452  1        1.345530
## EnvironmentSatisfaction 1.317550  3        1.047035
## OverTime                1.540195  1        1.241046
## MaritalStatus           1.309297  2        1.069694
## JobInvolvement          1.229895  3        1.035090
## JobSatisfaction         1.162593  3        1.025427
## JobLevel                2.230672  3        1.143069
outlierTest(reg7)
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
##     rstudent unadjusted p-value Bonferroni p
## 834 3.071789          0.0021278           NA
influenceIndexPlot(reg7)

influencePlot(reg7)

##           StudRes         Hat        CookD
## 317  -1.083660122 0.218460028 1.047011e-02
## 1084 -0.001454763 0.585381756 9.600931e-08
## 1221  2.587546650 0.056571533 4.736500e-02
## 572   2.737863674 0.006265692 1.060972e-02
## 834   3.071789049 0.004706782 1.934519e-02
## 973   1.810158045 0.199348469 3.794619e-02
reg7.no.outlier <- glm(Attrition ~ JobRole+ DistanceFromHome + YearsInCurrentRole + YearsSinceLastPromotion + EnvironmentSatisfaction + OverTime +MaritalStatus + JobInvolvement + JobSatisfaction + JobLevel, data = train.att, family = "binomial")
summ(reg7.no.outlier, exp = TRUE, confint = TRUE)
Observations 1276
Dependent variable Attrition
Type Generalized linear model
Family binomial
Link logit
χ²(21) 458.45
Pseudo-R² (Cragg-Uhler) 0.58
Pseudo-R² (McFadden) 0.49
AIC 529.41
BIC 642.74
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.00 0.00 Inf -0.04 0.97
JobRoleNurse 15200187.77 0.00 Inf 0.03 0.98
JobRoleOther 10069415.82 0.00 Inf 0.03 0.98
JobRoleTherapist 5417490.74 0.00 Inf 0.03 0.98
DistanceFromHome 1.08 1.05 1.11 5.30 0.00
YearsInCurrentRole 0.73 0.65 0.82 -5.36 0.00
YearsSinceLastPromotion 1.16 1.03 1.31 2.46 0.01
EnvironmentSatisfaction.L 0.25 0.15 0.41 -5.44 0.00
EnvironmentSatisfaction.Q 1.65 1.02 2.67 2.06 0.04
EnvironmentSatisfaction.C 0.93 0.58 1.49 -0.31 0.75
OverTimeYes 28.47 16.03 50.57 11.43 0.00
MaritalStatusMarried 1.81 0.89 3.67 1.64 0.10
MaritalStatusSingle 10.42 5.07 21.39 6.38 0.00
JobInvolvement.L 0.11 0.05 0.26 -5.05 0.00
JobInvolvement.Q 1.18 0.60 2.32 0.49 0.63
JobInvolvement.C 0.88 0.56 1.38 -0.57 0.57
JobSatisfaction.L 0.44 0.27 0.70 -3.40 0.00
JobSatisfaction.Q 0.96 0.59 1.57 -0.16 0.87
JobSatisfaction.C 0.79 0.49 1.30 -0.92 0.36
JobLevel.L 0.28 0.08 0.99 -1.98 0.05
JobLevel.Q 2.50 0.91 6.85 1.78 0.07
JobLevel.C 0.32 0.16 0.66 -3.09 0.00
Standard errors: MLE
AIC(reg7, reg7.no.outlier)
##                 df      AIC
## reg7            22 529.4085
## reg7.no.outlier 22 529.4085
influencePlot(reg7.no.outlier)

##           StudRes         Hat        CookD
## 317  -1.083660122 0.218460028 1.047011e-02
## 1084 -0.001454763 0.585381756 9.600931e-08
## 1221  2.587546650 0.056571533 4.736500e-02
## 572   2.737863674 0.006265692 1.060972e-02
## 834   3.071789049 0.004706782 1.934519e-02
## 973   1.810158045 0.199348469 3.794619e-02

Reg7 Performance, Test Data

if (!require(caret)) install.packages("caret")
## Loading required package: caret
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(caret)
test.probabilities <- predict(reg7, test.att, type = "response")
contrasts(test.att$Attrition)
##     Yes
## No    0
## Yes   1
predicted.Attrition<- ifelse(test.probabilities > 0.5, "Yes", "No")
mean(predicted.Attrition == test.att$Attrition)
## [1] 0.8934169
table(predicted.Attrition, test.att$Attrition)
##                    
## predicted.Attrition  No Yes
##                 No  267  24
##                 Yes  10  18

Conclusions